We would like to determine if tumors impact on the efficacy of myeloid cell development. For that purpose we are using the spontaneous breast cancer mouse model MMTV-PyMT. MMTV-PyMT is spontaneous mouse model of breast cancer. Genetically engineered female mice express the tumor oncogene PyMT under the MMTV promoter, leading to progressive (hyperplasia, adenoma, early and late carcinoma) tumor development in mammary glands. In the C57/Bl6 background, female developed palpable tumor around the age of 14w, and have a maximal tumor burden between the age of 18w to 25w old. Our preliminary results show an increase #hematopoiesis in tumor bearing mice with elevated numbers of early hematopoietic progenitors in the bone marrow of tumor bearing mice compared to littermate #control. In periphery (blood, spleen, tumor) we also find increased numbers of myeloid cells (monocyte, granulocytes, dendritic cells). Taken together these data suggest that cancer development impact on myelopoiesis in the bone marrow leading to increase production of myeloid cells in the periphery.To better understand the impact of cancer on hematopoiesis we are planning to perform single cell RNA sequencing on the HSPCs on tumor bearing mice compared to tumor free mice.

Cells from each condition were sequenced on the 10X Genomics platform, a droplet based approach to isolate single-cells for sequencing. As described in AlJahani et al (2018): “Droplet-based single-cell gene expression approaches use microfluidic chips to isolate single cells along with single beads in oil-encapsulated droplets, using microfluidics to bring oil, beads, and cell suspensions together in such a way that each droplet contains at most a single cell.20 The beads are coated with DNA oligos that are composed of a poly(T) tail at the 3′ end for the capture of cellular mRNAs, and at the 5′ end both a cell barcode that is identical for every oligo coating an individual bead and a library of individual unique molecular identifier (UMI) barcodes of high diversity, each UMI different for every oligo on the bead. The transcripts from each individual cell captured and labeled by the DNA oligos attached to a bead within the droplets are reverse transcribed, amplified with PCR, and sequenced using a high-throughput platform, after breaking and pooling droplet contents. The resulting sequences are aligned to a reference genome in order to annotate each transcript with its gene name. The cell barcodes on the aligned sequences allow for the computational linking of each gene transcript to its cell of origin. The number of copies of individual gene transcripts expressed in each individual cell is tallied using the UMIs, allowing the assembly of digital gene expression matrices (DGEs), which are tables of cell barcodes and gene counts.”

MMTV: mouse mammary tumor virus PyMT: Polyomavirus middle T antigen

Step 1: Prepare the workspace, loading the appropriate packages and reading the data in with associated metadata

#clear the workspace
rm(list=ls())

set.seed(12345)

#set the working directory and load in required libraries
setwd("/Users/jasoncosgrove/Dropbox (Team_Perie)/Jason/Experiments/Dry_Lab/JCB7_PyMT/scRNAseq_PyMT/scRNAseq_analysis/")
library(Seurat)
library(edgeR)
library(calibrate)
library(clustree)
library(dplyr)
source("helper_methods.R")
library(destiny)
library(dplyr)
library(ggrepel)
library(plotrix)
library(scales)
library(enrichR)

#color blind friendly color palette for the analysis
cbPalette <- c("light grey","#0072B2",
        "plum","orange","cornflowerblue", "darkolivegreen3", "#b19cd9")

#color palette when comparing PyMT and WT
cbPalette2 <- c(rgb(red = 169/255,green = 169/255,blue = 169/255, alpha = 1),rgb(67/255,138/255,201/255,1))

# The folder where you would like to store the results
output.path <- "/Users/jasoncosgrove/Dropbox (Team_Perie)/Jason/Experiments/Dry_Lab/JCB7_PyMT/scRNAseq_PyMT/scRNAseq_analysis/results/"

Load in the cellRanger outputs and convert into a Seurat object

lsks <- makeSeuratObject()

Step 2: Data QC

To assess the quality of the data we assess the library sizes, numbers of genes expressed and mitochondrial content per cell. Cells which have very high library sizes or relative to other cells in the data may represent doublet cells and so are filtered out. Cells with very low library sizes are typically because of poor capture quality pontentially due to cell death, premature rupture, or capture of random mRNA escaping from cells, consequently cells with low library sizes are also filtered out from downstream analyses.

Another important QC metric is mitochondrial content. As discussed in AlJanahi et al (2018) “High numbers of mitochondrial transcripts are indicators of cell stress, and therefore cells with elevated mitochondrial gene expression are often not included in the analysis, because most experiments will not benefit from clustering cells based on stress levels. However, just as with number of transcripts, this parameter is highly dependent on the tissue type and the questions being investigated. For example, 30% of total mRNA in the heart is mitochondrial due to high energy needs of cardiomyocytes, compared with 5% or less in tissues with low energy demands. For instance, 30% mitochondrial mRNA is representative of a healthy heart muscle cell, but would represent a stressed lymphocyte.”

# find the percentage of mitochondrial genes
lsks<- PercentageFeatureSet(object = lsks, pattern = "^mt-", col.name = "percent.mito")

# diagnostic plots
VlnPlot(object = lsks, features= c("nFeature_RNA", "nCount_RNA", "percent.mito"))
p1 <- FeatureScatter(object = lsks, feature1 = "nCount_RNA", feature2 = "percent.mito")
p2 <- FeatureScatter(object = lsks, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
patchwork::wrap_plots(p1,p2, ncol = 2)

# based on our plots lets filter cells above a given threshold 
lsks <- subset(x = lsks, subset = nCount_RNA > 1000 & nCount_RNA < 20000 & nFeature_RNA > 1000 
               & nFeature_RNA < 5000 & percent.mito < 4.5 & percent.mito > 1)

#lets visualise the data again to see the effects of our filtering
VlnPlot(object = lsks, features = c("nFeature_RNA", "nCount_RNA", "percent.mito"))

Step3: assign cell cycle phase to facilitate downstream analyses.

To assign a cell cycle phase to each cell, we use the cyclone method in the R package scran. In this scheme, a supervised learning approach was used to identify pairs of markers for each cell cycle phase. A G1 marker pair would comprise a gene with high expression in G1 relative to other phases, while the second gene would be lower in G1 relative to all other phases. To classify cell cycle phase on a new dataset, cyclone calculates the proportion of all marker pairs for which the expression of the first gene is higher than the second gene. A high proportion then suggests that the cell belongs to a given cell cycle phase.

# lets assign cell cycle phase using the cell cycle markers from Tirosh et al, 2015
# the method is in the scran package and was from John Marionis/Florian Buettner in 
# the Scialdone (2015) paper.  
lsks <- runCellCycleAnnotation(lsks)

  par(lwd = 2)
barplot(as.matrix(table(lsks@meta.data$phases)),beside = FALSE, col = cbPalette,legend = TRUE,
        args.legend = list(x = "bottomright",
                           inset = c(- .25, 0)))

axis(side = 2, lwd = 2)

# save the seurat object to file
saveRDS(lsks,file = "seuratObjects/seuratobjectAfterQC.Rda")

Step 4 Data Normalisation: assign cell cycle phase to facilitate downstream analyses.

When analyzing sequencing data, normalization to eliminate batch effects is crucial if multiple sequencing runs are to be compared with each other. These batch effects can be caused by often unavoidable technical variations such as the duration samples were kept on ice, number of freeze-thaw cycles, method of RNA isolation, sequencing depth, etc.

An additional consideration is that droplet-based sequencing in addition consists of thousands of individual cell experiments, hence cell-specific biases must also be considered when normalizing, in order to be able to compare the expression of one cell to another. A notable cell-specific bias is caused by mRNA capture efficiency, where the mRNA molecules are not captured by the bead at the same proportion in all droplets. As individual cells are not all of the same type a key consideration is how to retain cell to cell variability while eliminating technical noise.

To normalise our data we use 2 methods, the default seurat method and sctransform. For the default method feature counts for each cell are divided by the total counts for that cell and multiplied by the scale.factor. This is then natural-log transformed using log1p. For scTransform, in this approach the normalized values are Pearson residuals from regularized negative binomial regression, where cellular sequencing depth is used as a covariate. To scale the data, we center it around zero for each gene. The scaled data is used just for generating visualizations of the data, while the normalized matrix is used for statistical comparisons such as differential expression. In their preprint they show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates.

lsks <- readRDS("seuratObjects/seuratobjectAfterQC.Rda")
lsks <- NormalizeData(object = lsks)
lsks <- ScaleData(object = lsks)

Step 5 Variable Feature Selection

To find variably expressed genes of interest to take forward for further analysis we use Seurats vst method (figure 4). Briefly, this approach models the relationship between log mean expression and log variance using local polynomial regression. The features values are then standardized using the observed mean and predicted variance, with the final variance value calculated on the standardized values.

lsks <- FindVariableFeatures(lsks, selection.method = "vst", nfeatures = 5000) 
VariableFeaturePlot(lsks)

Step 6 Dimensionality Reduction

We perform dimensionality reduction on 2500 variably expressed genes using both principle component analysis, an approach to find the linear combination of genes that are the greatest source of variance in the data, and independent component analysis, a signal processing method designed to separate different signals (in this context a signal is a biological process) that are linearly mixed together. Informally, if you had a smoothie, ICA can tell you what the ingredients are.

We visualize our data using the non-linear dimensionality reduction technique UMAP. This approach is analogous to PCA, but can also identify non-linear patterns in the data. The goal of the algorithm is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. UMAP is preferable to t-SNE as it is faster to compute, and uses a graph based approach which permits the organisation clusters into a more biologically accurate representation than t-SNE. Importantly, we use the first 10 independent components as inputs into the UMAP algorithm, this reduces noise and when compared to PCA tended to group cells by established descriptions rather than by cell cycle status, although there was still a dominant cell cycle effect.

#Re-level the condition value so WT comes before the PyMT
lsks@meta.data$condition <- factor(lsks@meta.data$condition,levels = c("WT", "PyMT"))


lsks <- RunPCA(object = lsks, verbose = FALSE,npcs = 50)
ElbowPlot(lsks,ndims = 50)

DimPlot(lsks, reduction = "pca", group.by = "condition",cols =  cbPalette2)

DimPlot(lsks, reduction = "pca", group.by = "phases",cols = c("light grey","orange","#556b2f"))

lsks <- RunUMAP(object = lsks, dims = 1:20, verbose = FALSE,reduction = "pca")

#Overlay cell cycle information onto the UMAP visualisation
DimPlot(object = lsks, label = TRUE,reduction = "umap", 
        pt.size = 1,group.by= "phases", cols = c("light grey","orange","#556b2f"))

#Overlay experimental condition information onto the UMAP visualisation
DimPlot(object = lsks, label = F,reduction = "umap",
        pt.size = 1, group.by = "condition",
        cols = cbPalette2,split.by = "condition") 

#Overlay mouse-specific information onto the UMAP visualisation
DimPlot(object = lsks, label = F,reduction = "umap", pt.size = 1, 
        group.by = "mouse",cols = cbPalette, split.by = "mouse") 

Step 7 Downsampling to deal with unbalanced experimental design

Now that the initial preprocessing is done we divide our dataset into pymt and wt components to facilitate some downstream analyses.

Idents(lsks) <- lsks@meta.data$condition
#we also split the orginal object into WT and PyMT
wt <- subset(lsks, idents = c("WT"))
pymt <- subset(lsks, idents = c("PyMT"))

# as we have an unbalanced study design, that is more cells in
# the PyMT than the WT we downsample the pymt. This can be useful 
# for visualisation based approaches where having more cells 
# might bias the interpretation of the data
randomly.sampled.cells <- sample(colnames(pymt),size = ncol(wt))
pymt.downsampled <- subset(pymt, cells = randomly.sampled.cells)
lsks.downsampled <- subset(lsks, cells = c(colnames(wt), colnames(pymt.downsampled)))

Step 8 Density Map

overlay the density of pymt and wt lsks onto our existing UMAP to see if they localise differently.Thanks to Wilfrid Richer and Polina Pavlovich for this code

tiff(paste(output.path,"DensityUMAP_PyMT.tiff",sep = ""), width = 4, 
     height = 4, units = 'in', res = 300)
createDensityPlot(pymt)
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"DensityUMAP_WT.tiff",sep = ""), width = 4, 
     height = 4, units = 'in', res = 300)
createDensityPlot(wt)
dev.off()
## quartz_off_screen 
##                 2

Step 9 Unsupervised Clustering

To look at whether PyMT treatment affects a specific subcompartment of LSK cells we perform clustering using Seurats default approach. Briefly, this approach involves embedding cells in a graph structure such as a K-nearest neighbour graph, with edges drawn between cells with similar feature expression patterns, and then attempts to partition this graph into a number of highly interconnected subgroups. As LSK cells do not form discrete clusters, but rather show a smooth continuum of expression, our clustering results were highly sensitive to the resolution parameter of Seurats clustering algorithm.

lsks <- FindNeighbors(object = lsks, 
                      dims = 1:10,
                      reduction = "pca",
                      verbose = FALSE,force.recalc = TRUE)


for(i in seq(from=0.1, to=1.0, by=0.1)){
    lsks <- FindClusters(object = lsks, 
    resolution = i,verbose = FALSE)
  }

tiff(paste(output.path,"clustree.tiff",sep = ""), width = 6, height = 8, units = 'in', res = 300)
clustree(lsks)
dev.off()
## quartz_off_screen 
##                 2
#now pick a resolution and plot it
lsks <- FindClusters(object = lsks, resolution = 0.4,verbose = FALSE)

Step 10 Distribution of PyMT and WT amongst the different clusters

#give the cluster a name and re-order the clusters
lsks <- RenameIdents(object = lsks, 
                       `0` = "MPP3",
                       `1` = "WT-HSC",
                       `2` = "Cycling",
                       `3` = "PyMT-HSC",
                       `4` = "MPP2",
                       `5` = "MPP2_1",
                       `6` = "MPP4")


lsks@active.ident <- factor(lsks@active.ident,
                            levels = c("WT-HSC",
                                       "PyMT-HSC",
                                       "MPP2",
                                       "MPP2_1",
                                       "MPP3",
                                       "MPP4",
                                       "Cycling"))

#Upon revising the data, Yohan and Julie prefer to keep numbers rather than names for clusters. Here we just rename the clusters as numbers, keeping the same ordering as before. Once the pipeline is fully finalised following reviewers comments we will refactor this part
lsks <- RenameIdents(object = lsks, 
                        "WT-HSC" = '0',
                        "PyMT-HSC" = '1',
                        "MPP2" = '2',
                        "MPP2_1" = '3',
                        "MPP3" = '4',
                        "MPP4"= '5',
                        "Cycling" = '6')


#save the clustering information to the metadata slot. 
lsks@meta.data$seurat_clusters <- lsks@active.ident


tiff(paste(output.path,"UMAP.tiff",sep = ""), width = 5, height = 4, units = 'in', res = 300)
DimPlot(object = lsks, label = FALSE,reduction = "umap", 
        pt.size = 0.1,cols = cbPalette) 
dev.off()
## quartz_off_screen 
##                 2
df <- (table( lsks@active.ident,lsks@meta.data$condition))
opar <- par(lwd = 2)



#look at the abundance of WT and PyMT amongst the difference clusters
tiff(paste(output.path,"clusterabundance.tiff",sep = ""), width = 4, height = 4, units = 'in', res = 300)
barplot(df, col =  cbPalette,legend = TRUE,
        args.legend = list(x = "bottomright",
                           inset = c(- .0, 0.05)))
axis(side = 2, lwd = 2)
dev.off()
## quartz_off_screen 
##                 2
#look at the abundance of cycling cells amongst the difference clusters
tiff(paste(output.path,"clusterabundance_FINAL.tiff",sep = ""), width = 4, height = 4, units = 'in', res = 300)
barplot(t(t(df) /colSums(df)),col = cbPalette)

axis(side = 2, lwd = 2)
dev.off()
## quartz_off_screen 
##                 2

Step 11 Celltype Annotation

After fixing the resolution parameter, we then see which clusters overlap with previously described transcriptomic signatures. These signatures, along with associated references, are provided in a supplementary excel file.

gene.sets <- read.csv('genesets/genesets.csv')
gene.set.names <- colnames(gene.sets)
lsks <- geneSignatureScoring(lsks, gene.sets, gene.set.names,assay = "RNA")


tiff(paste(output.path,"sig_scores.tiff",sep = ""), width = 8, 
     height = 6, units = 'in', res = 300)

VlnPlot(lsks, features = c("WilsonMolO1",
                           "Pia_MPP11",
                           "Pia_MPP21",
                           "Pia_MPP31",
                           "Pia_MPP41",
                           "Pia_MPP51"),
        pt.size = 0,group.by = 'condition',
        cols = cbPalette2) 

dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"sig_scores1.tiff",sep = ""), width = 8, 
     height = 6, units = 'in', res = 300)

VlnPlot(lsks, features = c("WilsonMolO1",
                           "Pia_MPP11",
                           "Pia_MPP21",
                           "Pia_MPP31",
                           "Pia_MPP41",
                            "Pia_MPP51"),
        pt.size = 0,cols =  cbPalette)

dev.off()
## quartz_off_screen 
##                 2
#In this set of plots, Yohan and Julie have asked for violin pots of very specific genes, we plot them here. 

tiff(paste(output.path,"sig_scoresGENES.tiff",sep = ""), width = 8, 
     height = 6, units = 'in', res = 300)

VlnPlot(lsks, features = c("Meis1","Myl10","Gata1","Elane"),
        pt.size = 0,cols =  cbPalette,ncol = 2)

dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"sig_scoresGENES2.tiff",sep = ""), width = 8, 
     height = 6, units = 'in', res = 300)

VlnPlot(lsks, features = c("Mpo","Dntt","Mki67","Top2a"),
        pt.size = 0,cols =  cbPalette,ncol = 2)

dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"wilsonMolO.tiff",sep = ""), 
     width = 4, height = 4, units = 'in', res = 300)
FeaturePlot(lsks, features = "WilsonMolO1", min.cutoff = "q5", 
            max.cutoff = "q95",pt.size = 1,cols = c("grey20",'green'))
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"MPP1.tiff",sep = ""), 
     width = 4, height = 4, units = 'in', res = 300)
FeaturePlot(lsks, features = "Pia_MPP11", min.cutoff = "q5", 
            max.cutoff = "q95",pt.size = 1,cols = c("grey20",'green'))
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"MPP2.tiff",sep = ""), 
     width = 4, height = 4, units = 'in', res = 300)
FeaturePlot(lsks, features = "Pia_MPP21", min.cutoff = "q5", 
            max.cutoff = "q95",pt.size = 1,cols = c("grey20",'green'))
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"MPP3.tiff",sep = ""), width = 4, 
     height = 4, units = 'in', res = 300)
FeaturePlot(lsks, features = "Pia_MPP31", min.cutoff = "q5", 
            max.cutoff = "q95",pt.size = 1,cols = c("grey20",'green'))
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"MPP4.tiff",sep = ""),
     width = 4, height = 4, units = 'in', res = 300)
FeaturePlot(lsks, features = "Pia_MPP41", min.cutoff = "q5", 
            max.cutoff = "q95",pt.size = 1,cols = c("grey20",'green'))
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"MPP5.tiff",sep = ""),
     width = 4, height = 4, units = 'in', res = 300)
FeaturePlot(lsks, features = "Pia_MPP51", min.cutoff = "q5",
            max.cutoff = "q95",pt.size = 1,cols = c("grey20",'green'))
dev.off()
## quartz_off_screen 
##                 2

Here we plot some additional UMAP and violin plot based visualisations.

lsks@meta.data$seurat_clusters <- lsks@active.ident


tiff(paste(output.path,"UMAP.tiff",sep = ""), 
     width = 8, height = 6, units = 'in', res = 300)
DimPlot(lsks,cols = cbPalette)
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"UMAP_split.tiff",sep = ""), 
     width = 8, height = 6, units = 'in', res = 300)
DimPlot(lsks,cols = cbPalette2, split.by = "condition",group.by = "condition")
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"sig_scores3.tiff",sep = ""), 
     width = 8, height = 6, units = 'in', res = 300)
VlnPlot(lsks, features = c("WilsonMolO1",
                           "Pia_MPP11",
                           "Pia_MPP21",
                           "Pia_MPP31",
                           "Pia_MPP41",
                           "Pia_MPP51"),
        pt.size = 0,
        cols = cbPalette)
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"clusters_annotated_barplot.tiff",sep = ""), 
     width = 4, height = 4, units = 'in', res = 300)
df <- (table( lsks@active.ident,lsks@meta.data$condition))
opar <- par(lwd = 2)
barplot(t(t(df) /colSums(df)),legend =TRUE, col = cbPalette)
axis(side = 2, lwd = 2)
dev.off()
## quartz_off_screen 
##                 2
saveRDS(lsks,file = "seuratObjects/seuratobjectAfterClusteringAnnotation.Rda")

Step 12 Identification of Differentially Expressed Genes Between Clusters

To understand between key differences between our clusters, we perform a differential expression analysis. It is important to note that different approaches for differential expression analysis of single cells rely on different assumptions about that data, and consequently can give very different results5. To address this limitation we use two different, but complementary, approaches: edgeR and logistic regression.

In the approach below we employ a logistic regression framework to determine differentially expressed genes. Specifically, we construct a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test. This approach is advantageous for the analysis of HSPCs which have an expression profile that is distinct from mature cell subsets. Typically, expression profiles are bimodal, and changes in the magnitude of expression are subtle. In other datasets we have noted that cells have similar expression magnitudes but that different proportions of cells are positive for a given gene, within a given group. Given these unique features of our data, we posit that logistic regression is well suited to performing differential expression analysis of our data.

markers <-  FindAllMarkers(lsks,only.pos = TRUE,test.use = "LR",logfc.threshold = 0.1)
x <-markers %>% group_by(cluster) %>% top_n(8, avg_log2FC)
print(tbl_df(x), n=40)
## # A tibble: 56 × 7
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene     
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>    
##  1 0              0.951 0.977 0.869 0         0       Car2     
##  2 3.28e-266      0.976 0.55  0.271 3.94e-262 0       Hlf      
##  3 5.50e-218      1.09  0.534 0.25  6.59e-214 0       Ltb      
##  4 4.35e-203      0.816 0.726 0.525 5.22e-199 0       Gimap1   
##  5 1.87e-201      0.837 0.598 0.41  2.24e-197 0       Lst1     
##  6 1.40e-192      0.861 0.569 0.368 1.67e-188 0       Txnip    
##  7 5.87e- 76      0.881 0.121 0.033 7.04e- 72 0       Meg3     
##  8 6.07e- 51      0.863 0.202 0.123 7.28e- 47 0       Cd74     
##  9 0              1.22  0.894 0.827 0         1       Pim1     
## 10 0              0.952 0.99  0.953 0         1       Ifitm1   
## 11 6.23e-222      1.22  0.734 0.668 7.47e-218 1       Junb     
## 12 2.22e-144      0.857 0.829 0.842 2.67e-140 1       Jund     
## 13 9.18e-136      0.863 0.285 0.148 1.10e-131 1       Socs3    
## 14 1.05e-110      0.947 0.422 0.276 1.25e-106 1       Ltb      
## 15 7.00e- 67      1.12  0.557 0.522 8.40e- 63 1       Wfdc17   
## 16 2.18e- 20      0.839 0.251 0.282 2.61e- 16 1       Hist1h2bc
## 17 0              1.09  0.913 0.56  0         2       Tyms     
## 18 0              1.04  0.899 0.509 0         2       Lig1     
## 19 0              1.03  0.793 0.379 0         2       Tk1      
## 20 0              0.900 0.956 0.673 0         2       Pcna     
## 21 0              0.762 1     0.935 0         2       Tuba1b   
## 22 1.42e-257      0.946 0.761 0.388 1.70e-253 2       Rrm2     
## 23 6.13e-228      0.736 0.784 0.44  7.35e-224 2       Hells    
## 24 2.77e-174      0.759 0.683 0.37  3.32e-170 2       Ung      
## 25 0              3.61  0.556 0.122 0         3       Pf4      
## 26 0              2.38  0.989 0.619 0         3       Apoe     
## 27 0              2.13  0.842 0.213 0         3       Itga2b   
## 28 0              1.72  0.82  0.247 0         3       Pbx1     
## 29 0              1.58  0.666 0.043 0         3       Gp5      
## 30 0              1.47  0.866 0.686 0         3       Rap1b    
## 31 0              1.44  0.859 0.463 0         3       Cd9      
## 32 0              1.43  0.625 0.037 0         3       Gata1    
## 33 0              3.56  0.797 0.374 0         4       Elane    
## 34 0              2.57  0.992 0.833 0         4       Mpo      
## 35 0              2.45  0.952 0.513 0         4       Ctsg     
## 36 0              1.57  0.451 0.166 0         4       Ms4a3    
## 37 0              1.47  0.929 0.602 0         4       Lgals1   
## 38 0              1.39  0.576 0.074 0         4       Hp       
## 39 0              1.23  1     0.934 0         4       Plac8    
## 40 0              1.19  0.928 0.504 0         4       Sdf2l1   
## # … with 16 more rows
write.csv(markers, file = paste(output.path,"cluster_markers.csv"))


tiff(paste(output.path,"heatmap.tiff",sep = ""), 
     width = 8, height = 6, units = 'in', res = 300)
DoHeatmap(lsks, features = x$gene,disp.min = -1.5,
          disp.max = 1.5,label = F,group.colors = cbPalette) 
dev.off()
## quartz_off_screen 
##                 2

Are any of our clusters enriched for a given cell type or cycling cells?

df <- table( lsks@meta.data$condition,lsks@active.ident)


tiff(paste(output.path,"barplot_clusterenrichment.tiff",sep = ""), width = 8, height = 6, units = 'in', res = 300)
opar <- par(lwd = 2)
barplot(t(t(df) /colSums(df)), legend = T,col = cbPalette2, 
        args.legend = list(x = "bottomright", inset = c(- 0.0, 0.05)))
axis(side = 2, lwd = 2)
dev.off()
## quartz_off_screen 
##                 2
Idents(lsks) <- lsks@meta.data$seurat_clusters
df <- table( lsks@meta.data$phases,lsks@active.ident)
barplot(table( lsks@meta.data$phases,lsks@active.ident), legend = T, col = cbPalette,args.legend = list(x = "bottomleft", inset = c(- 0.0, 0.05)))

opar <- par(lwd = 2)

# same plot as above but normalised
barplot(t(t(df) /colSums(df)), legend = T,col = cbPalette,args.legend = list(x = "bottomleft", inset = c(- 0.0, 0.05)))
axis(side = 2, lwd = 2)

tiff(paste(output.path,"barplot_cyclingenrichment.tiff",sep = ""), width = 4, height = 6, units = 'in', res = 300)
opar <- par(lwd = 2)
barplot(t(t(df) /colSums(df)), col = cbPalette,legend = TRUE,
        args.legend = list(x = "bottomleft",
                           inset = c(- 0.0, 0.05)))
axis(side = 2, lwd = 2)
dev.off()
## quartz_off_screen 
##                 2

Step 13 look at differentially expressed genes between experimental conditions

gene.list <- c('Jund','Junb','Ifitm1','Cebpb','Elane','Ms4a6c',
                'Mpo','Ifitm2','Ifitm3','Ctsg','Pf4','Car2','Meis1','Kit',
                'Igf1r','Cd24a','Mllt3','Pbx1','Gata2','Mecom','Cdc42','Ndufa12',
                'Cox6a1','Atp5o')
  
  
  
#generate a volcano plot to visualise DEGs
volcanoPlotHSPC <- function(res,gene.list){
  res$gene <- rownames(res)
  
  res$sign <- 0
  res$sign[which(res$p_val_adj < 0.05 & res$avg_log2FC > 0.1)] <- 2
  res$sign[which(res$p_val_adj < 0.05 & res$avg_log2FC < -0.1)] <- 1
  res$p_val_adj[res$p_val_adj == 0] <- 1e-312
  
  p <- ggplot(data=res, aes(x=avg_log2FC, y=-log10(p_val_adj), colour=as.factor(sign))) + geom_point( size=1) +
    scale_color_manual(name="", values=c("2" = rgb(67/255,138/255,201/255,1),"1"="#999998", "0"=rgb(220/255,220/255, 220/255,0.2))) +  
    theme(legend.position = "none") + xlim(-1.4,1.4) + 
    xlab("log2 fold change") + ylab("-log10 pvalue") + 
    geom_vline(xintercept=c(-0.1, 0.1), linetype=2) + 
    geom_hline(yintercept=-log10(0.05), linetype=2) 
  
  
  
  p <- p  + theme(
    text = element_text(size = 20),
    # Remove panel border
    panel.border = element_blank(),  
    # Remove panel grid lines
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    # Remove panel background
    panel.background = element_blank(),
    # Add axis line
    axis.line = element_line(colour = "grey")
  ) +geom_text_repel(data=res[gene.list,],aes(label=gene),size = 5,force = 2,force_pull = 5,color = "black",min.segment.length = unit(0, 'lines'), 
                     nudge_y = .2)
  
  
  
  return(p)
  
}




#Volcano plot of DEGs across conditions
tiff(paste(output.path,"HSPCvolcanoPlot.tiff",sep = ""), 
     width = 8, height = 6, units = 'in', res = 300)
Idents(lsks) <- lsks@meta.data$condition
markers <-  FindMarkers(lsks,ident.1 = "PyMT",logfc.threshold = 0.001,
                        test.use = "LR",min.cells.feature = 100)

write.csv(markers, file = paste(output.path,"HSPC_PyMTvsWT_markers.csv"))
volcanoPlotHSPC(markers,gene.list)
dev.off()
## quartz_off_screen 
##                 2

Step 14 look at differentially expressed genes controlling for cell cycle

Another approach to look for differentially expressed genes is to use using the edgeR package, a process which involves fitting negative binomial generalized log-linear model to the read counts for each gene. This approach is advantageous as (a) it using a negative binomial distribution rather than assuming normality which is not appropriate for single cell data; (b) it permits the use of covariates in differential expression analysis (e.g batch effects, differences in mitochondrial content, or cell cycle phase); (c) the approach performs well in single-cell benchmarking studies. As PCA and ICA show that there is a dominant cell cycle effect in our dataset we add cycling status (described further in section 2) as a covariate to our analysis to regress this feature out.

condition <- relevel(factor(lsks@meta.data$condition), ref="WT")
phases <- factor(lsks@meta.data$phases)

e <- DGEList(counts=lsks@assays$RNA@data)
e$samples$group <- condition
e$samples$condition <- condition
edesign <- model.matrix(~ phases + condition) # we dont have a contrast matrix here so we should leave out the zero
e <- calcNormFactors(e, method = "none") # the data has already been normalised in seurat using the scTransform method, so we dont need to redo it. 

e <- estimateDisp(e, edesign)
efit <- glmQLFit(e,edesign) 
efit <- glmQLFTest(efit,coef = "conditionPyMT")

etable <- topTags(efit, n=nrow(e))$table
etable <- etable[order(etable$FDR), ]

#do a volcano plot of the results
res <- cbind.data.frame(etable$logFC, etable$FDR,rownames(etable))
colnames(res) <- c("log2FoldChange", "pvalue","gene")
res[res$pvalue == 0,2] <- 1e-300
with(res, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot", xlim=c(-1.5,1.5)))
with(subset(res, pvalue<.05 ), points(log2FoldChange, -log10(pvalue), pch=20, col="black"))
with(subset(res, abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
with(subset(res, pvalue<.05 & abs(log2FoldChange)>0.5), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
with(subset(res, pvalue<.05 & abs(log2FoldChange)>0.5), textxy(log2FoldChange, -log10(pvalue), labs=gene, cex=.6,offset = 0.8))


write.csv(etable, file = paste(output.path,"HSPC_PyMTvsWT_markers_cellcycleregression.csv"))

Step 15 Gene set enrichment analysis using EnrichR

More information about the EnrichR algorithm can be found at the following publication: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-128 In short it uses an adaptation of the Fishers exact test, where they look at the enrichment scores of randomised sets of gene-sets to build a null distribution and then test for significance of real gene-sets using this reference distribution

dbs <- listEnrichrDbs()

Idents(lsks) <- lsks@meta.data$seurat_clusters
HSCs <- subset(lsks, idents = c("PyMT-HSC","WT-HSC"))
MPP3 <- subset(lsks, idents = c("MPP3"))
Idents(lsks) <- lsks@meta.data$condition
Idents(HSCs) <- HSCs@meta.data$condition
Idents(MPP3) <- MPP3@meta.data$condition


tiff(paste(output.path,"pathwaysHSPCs.tiff",sep = ""), width = 12, height = 6, units = 'in', res = 300)
DEenrichRPlot(
  lsks,
  ident.1 = "PyMT",
  balanced = FALSE,
  logfc.threshold = 0.1,
  assay = NULL,
  max.genes = 2000,
  test.use = "LR",
  p.val.cutoff = 0.05,
  cols = NULL,
  enrich.database = c("GO_Biological_Process_2018"),
  num.pathway = 25,
  return.gene.list = FALSE
)
dev.off()

tiff(paste(output.path,"pathwaysHSCs.tiff",sep = ""), width = 12, height = 6, units = 'in', res = 300)
DEenrichRPlot(
  HSCs,
  ident.1 = "PyMT",
  balanced = FALSE,
  logfc.threshold = 0.1,
  assay = NULL,
  max.genes = 2000,
  test.use = "LR",
  p.val.cutoff = 0.05,
  cols = NULL,
  enrich.database = c("GO_Biological_Process_2018"),
  num.pathway = 25,
  return.gene.list = FALSE
)
dev.off()

tiff(paste(output.path,"pathwaysMPP3s.tiff",sep = ""), width = 12, height = 6, units = 'in', res = 300)
DEenrichRPlot(
  MPP3,
  ident.1 = "PyMT",
  balanced = FALSE,
  logfc.threshold = 0.1,
  assay = NULL,
  max.genes = 2000,
  test.use = "LR",
  p.val.cutoff = 0.05,
  cols = NULL,
  enrich.database = c("GO_Biological_Process_2018"),
  num.pathway = 25,
  return.gene.list = FALSE
)
dev.off()

Step 16 Heterogeneity of the PyMT response

Now that we know the molecular differences between WT and PyMT LSKs we can ask if these molecular changes are distributed homogeneously throughout the progneitor compartment or are they confined to specific subcompartments. To do this we generate a PYMT response signature with genes that are upregulated in PyMT relative to WT and then calculate the mean expression of these gene-set in single cells.

Idents(lsks) <- lsks@meta.data$condition
markers <-  FindMarkers(lsks,ident.1 = "PyMT",logfc.threshold = 0.1,
                        test.use = "LR",min.cells.feature = 100,only.pos = TRUE)

#filter the signature to take only genes that are significantly enriched in PyMT
markers <- markers[markers$p_val_adj < 0.05 & markers$avg_log2FC > 0,]

#create a composite score for all of these genes and overlay expression onto our UMAP visualisation
lsks <- AddModuleScore(lsks, features = list(rownames(markers)),name = "PyMT")

tiff(paste(output.path,"PyMT_responders.tiff",sep = ""),
     width = 6, height = 6, units = 'in', res = 300)
FeaturePlot(lsks, features = "PyMT1", min.cutoff = "q20", max.cutoff = "q97", 
            pt.size = 1, cols = c("grey10","green"))
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"PyMT_responders_VlnPlot.tiff",sep = ""),
     width = 5, height = 4, units = 'in', res = 300)
Idents(lsks) <- lsks@meta.data$seurat_clusters
VlnPlot(lsks, features = "PyMT1",cols = cbPalette,pt.size = 0)
dev.off()
## quartz_off_screen 
##                 2

Step 17 Focused analysis of HSCs

Based on our UMAP visualisation there seems to some changes occuring in the HSC compartment, lets do a more focused analysis to investigate this further

Idents(lsks) <- lsks@meta.data$seurat_clusters
HSCs <- subset(lsks, idents = c("0","1"))

#generate a volcano plot to visualise DEGs
volcanoPlotHSC <- function(res,gene.list){
  res$gene <- rownames(res)
  
  res$sign <- 0
  res$sign[which(res$p_val_adj < 0.05 & res$avg_log2FC > 0.1)] <- 2
  res$sign[which(res$p_val_adj < 0.05 & res$avg_log2FC < -0.1)] <- 1
  res$p_val_adj[res$p_val_adj < 1e-150] <- 1e-150
  
  p <- ggplot(data=res, aes(x=avg_log2FC, y=-log10(p_val_adj), colour=as.factor(sign))) + geom_point( size=1) +
    scale_color_manual(name="", values=c("2" = rgb(67/255,138/255,201/255,1),"1"="#999998", "0"=rgb(220/255,220/255, 220/255,0.2))) +  
    theme(legend.position = "none") + xlim(-1.6,1.6) + 
    xlab("log2 fold change") + ylab("-log10 pvalue") + 
    geom_vline(xintercept=c(-0.1, 0.1), linetype=2) + 
    geom_hline(yintercept=-log10(0.05), linetype=2) 
  
  
  
  p <- p  + theme(
    text = element_text(size = 20),
    # Remove panel border
    panel.border = element_blank(),  
    # Remove panel grid lines
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    # Remove panel background
    panel.background = element_blank(),
    # Add axis line
    axis.line = element_line(colour = "grey")
  ) +geom_text_repel(data=res[gene.list,],aes(label=gene),
                     size = 4,force = 2,force_pull = 5,color = "black",
                     min.segment.length = unit(0, 'lines'), 
                     nudge_y = .2,max.overlaps = 15)
  
  
  
  return(p)
  
}



tiff(paste(output.path,"HSC_volcano.tiff",sep = ""), width = 8, height = 6, units = 'in', res = 300)
Idents(HSCs) <- HSCs@meta.data$condition
hsc.markers <-  FindMarkers(HSCs,ident.1 = "PyMT",logfc.threshold = 0.001,test.use = "LR",min.cells.feature = 100)

#need to see which genes are significant now
gene.list2 <- gene.list[hsc.markers[gene.list,]$p_val_adj < 0.05]


gene.list2 <- c('Jund','Junb','Ifitm1','Cebpb','Elane',
                'Mpo','Ifitm2','Ifitm3','Pf4','Car2','Meis1','Kit',
              'Cd24a','Mllt3','Pbx1','Gata2','Mecom','Ndufa12',
                'Atp5o')

volcanoPlotHSC(hsc.markers,gene.list2)
dev.off()
## quartz_off_screen 
##                 2
write.csv(hsc.markers,paste(output.path,"HSC_PyMTvsWT.csv",sep = ""))

Step 18 Focused analysis of MPP3s

Based on our UMAP visualisation there seems to some changes occuring in the HSC compartment, lets do a more focused analysis to investigate this further

Idents(lsks) <- lsks@meta.data$seurat_clusters
MPP3 <- subset(lsks, idents = c("4"))


Idents(MPP3) <-MPP3@meta.data$condition
markers <-  FindMarkers(MPP3,ident.1 = "PyMT",logfc.threshold = 0.001,test.use = "LR",min.cells.feature = 100)


#generate a volcano plot to visualise DEGs
volcanoPlotMPP3 <- function(res,gene.list){
  res$gene <- rownames(res)
  
  res$sign <- 0
  res$sign[which(res$p_val_adj < 0.05 & res$avg_log2FC > 0.1)] <- 2
  res$sign[which(res$p_val_adj < 0.05 & res$avg_log2FC < -0.1)] <- 1
  res$p_val_adj[res$p_val_adj == 0] <- 1e-312
  
  p <- ggplot(data=res, aes(x=avg_log2FC, y=-log10(p_val_adj), colour=as.factor(sign))) + geom_point( size=1) +
    scale_color_manual(name="", values=c("2" = rgb(67/255,138/255,201/255,1),"1"="#999998", "0"=rgb(220/255,220/255, 220/255,0.2))) +  
    theme(legend.position = "none") + xlim(-1.6,1.6) + 
    xlab("log2 fold change") + ylab("-log10 pvalue") + 
    geom_vline(xintercept=c(-0.1, 0.1), linetype=2) + 
    geom_hline(yintercept=-log10(0.05), linetype=2) 
  
  
  
  p <- p  + theme(
    text = element_text(size = 20),
    # Remove panel border
    panel.border = element_blank(),  
    # Remove panel grid lines
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    # Remove panel background
    panel.background = element_blank(),
    # Add axis line
    axis.line = element_line(colour = "grey")
  ) +geom_text_repel(data=res[gene.list,],aes(label=gene),
                     size = 4,force = 2,force_pull = 5,color = "black",
                     min.segment.length = unit(0, 'lines'), 
                     nudge_y = .2,max.overlaps = 25)
  
  
  
  return(p)
  
}


gene.list2 <- gene.list[markers[gene.list,]$p_val_adj < 0.05]


tiff(paste(output.path,"MPP3Volcano.tiff",sep = ""), width = 8, height = 6, units = 'in', res = 300)
volcanoPlotMPP3(markers,gene.list2)
dev.off()
## quartz_off_screen 
##                 2
write.csv(markers,paste(output.path,"MPP3_PyMTvsWT.csv",sep = ""))

Step 19 Focused analysis of Cycling cells

We also saw changes in the cycling compartment so lets check that also.

tiff(paste(output.path,"cyclingUMAPLSKS.tiff",sep = ""), 
     width = 8, height = 6, units = 'in', res = 300)
DimPlot(lsks, group.by = "phases",cols = cbPalette,pt.size = 1)
dev.off()
## quartz_off_screen 
##                 2
cell.subset <- colnames(lsks)[lsks@meta.data$phases != "G1"]
cycling <- subset(lsks, cells = cell.subset)
Idents(cycling) <- cycling@meta.data$condition
DimPlot(cycling)

cycling <- FindVariableFeatures(cycling, selection.method = "vst", nfeatures = 5000) # was 1500
cycling <- RunPCA(object = cycling, verbose = FALSE,npcs = 50)
ElbowPlot(cycling,ndims = 50)

cycling <- RunUMAP(object = cycling, dims = 1:10, verbose = FALSE,reduction = "pca")

tiff(paste(output.path,"cycling.tiff",sep = ""), width = 8, 
     height = 6, units = 'in', res = 300)
DimPlot(cycling, group.by = "condition",cols = c(cbPalette2),pt.size = 1)
dev.off()
## quartz_off_screen 
##                 2
cycling <- FindNeighbors(object = cycling, 
                                  dims = 1:10,
                                  reduction = "pca",
                                  verbose = FALSE,force.recalc = TRUE)


for(i in seq(from=0.1, to=1.0, by=0.1)){
    cycling<- FindClusters(object = cycling, 
    resolution = i,verbose = FALSE)
  }


clustree(cycling)

pymt.cells <- colnames(cycling)[cycling@meta.data$condition == "PyMT"]
wt.cells <- colnames(cycling)[cycling@meta.data$condition == "WT"]
cycling.pymt <- subset(cycling, cells = pymt.cells)
cycling.wt <- subset(cycling, cells = wt.cells)

#here is another density plot method kindly provided by wilfrid
tiff(paste(output.path,"cycling_density_WT.tiff",sep = ""), 
     width = 8, height = 6, units = 'in', res = 300)
createDensityPlot(cycling.wt)
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"cycling_density_PyMT.tiff",sep = ""),
     width = 8, height = 6, units = 'in', res = 300)
createDensityPlot(cycling.pymt)
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"cyclingMPP1.tiff",sep = ""), 
     width = 4, height = 4, units = 'in', res = 300)
FeaturePlot(cycling, feature = "Pia_MPP11", min.cutoff = "q3", 
            max.cutoff = "q97",
            cols = c("grey10","green"),pt.size = 1)
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"cyclingMPP2.tiff",sep = ""),
     width = 4, height = 4, units = 'in', res = 300)
FeaturePlot(cycling, feature = "Pia_MPP21", min.cutoff = "q20", 
            max.cutoff = "q97",
            cols = c("grey10","green"),pt.size = 1)
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"cyclingMPP3.tiff",sep = ""), 
     width = 4, height = 4, units = 'in', res = 300)
FeaturePlot(cycling, feature = "Pia_MPP31", min.cutoff = "q20",
            max.cutoff = "q97",
            cols = c("grey10","green"),pt.size = 1)
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"cyclingMPP4.tiff",sep = ""), 
     width = 4, height = 4, units = 'in', res = 300)
FeaturePlot(cycling, feature = "Pia_MPP41", min.cutoff = "q3", 
            max.cutoff = "q97",
            cols = c("grey10","green"),pt.size = 1)
dev.off()
## quartz_off_screen 
##                 2
tiff(paste(output.path,"cyclingMPP5.tiff",sep = ""), 
     width = 4, height = 4, units = 'in', res = 300)
FeaturePlot(cycling, feature = "Pia_MPP51", min.cutoff = "q3",
            max.cutoff = "q97",
            cols = c("grey10","green"),pt.size = 1)
dev.off()
## quartz_off_screen 
##                 2
save(cycling,file = "cycling.Rda")

Step 18 Ternary plots to assess biased gene-expression patterns

We also saw changes in the cycling compartment so lets check that also.

#do a ternary plot of lineage potential, this could be cool for my scRNAbarcoding data. 

lsks.downsampled <- subset(lsks, cells = c(colnames(wt), colnames(pymt.downsampled)))
Idents(cycling) <- cycling@meta.data$condition

wt.cycling <- subset(cycling, idents = 'WT')
pymt.cycling  <- subset(cycling, idents = 'PyMT')


makeTernaryPlot(wt.cycling, "grey50")

makeTernaryPlot(pymt.cycling, "dark blue")

Step 19 Model differentiation trajectories using diffusion maps.

Given the limitations of our clustering analysis we model the data as a differentiation trajectory, using a diffusion map approach as implemented in the R package destiny. Diffusion maps are spectral method for non-linear dimension reduction introduced by Coifman et al. (2005). Diffusion maps are based on a distance metric (diffusion distance) which is conceptually relevant to how differentiating cells follow noisy diffusion-like dynamics, moving from a pluripotent state towards more differentiated states. The R package destiny implements the formulation of diffusion maps presented in Haghverdi et al.(2015) which is especially suited for analyzing single-cell gene expression data from time-course experiments. It implicitly arranges cells along their developmental path, with bifurcations where differentiation events occur.

We initially perform this analysis on WT and PyMT independently, but found that the major lineage trajectories erythroid, myeloid and lymphoid were observed in both cases albeit in different diffusion components. This predicts that the major haematopoietic structure was observed in both settings but that there is evidence of a molecular bias. Given that we found no obvious new trajectories that were exclusive to PyMT mice we performed a diffusion map analysis across the entire dataset, an approach that would allow us to better compare lineage bias between WT and PyMT mice.

Idents(lsks) <- lsks@meta.data$condition
DefaultAssay(object = lsks) <- "RNA" #was SCT for some reason!
lsks <- FindVariableFeatures(lsks, selection.method = "vst", nfeatures = 3000)

subset.wt <- subset(lsks, idents = c("WT"))
subset.pymt <- pymt.downsampled


lsks.downsampled <- subset(lsks, cells = c(colnames(subset.wt), colnames(subset.pymt)))

lsks.regressed <- ScaleData(lsks.downsampled, vars.to.regress = "phases")
## Regressing out phases
## Centering and scaling data matrix
df <- lsks.regressed@assays$RNA@scale.data[lsks.regressed@assays$RNA@var.features,]
df <- t(df)
sigmas <- find_sigmas(df, verbose = FALSE)

dm.combined <- DiffusionMap(df,sigma = optimal_sigma(sigmas))
## Warning in DiffusionMap(df, sigma = optimal_sigma(sigmas)): You have 3000 genes.
## Consider passing e.g. n_pcs = 50 to speed up computation.
dmap.combined <- eigenvectors(dm.combined)[, 1:20]

rownames(dmap.combined) <- colnames(lsks.downsampled)
lsks.downsampled[["dmap_nocc"]] <- CreateDimReducObject(embeddings = dmap.combined, key = "DMAPNOCC_", assay = 
                                              DefaultAssay(lsks.downsampled))



subset.wt <- subset(lsks.downsampled, idents = c("WT"))
subset.pymt <- subset(lsks.downsampled, idents= c("PyMT"))


DimPlot(object = lsks.downsampled, label = F,reduction = "dmap_nocc", pt.size = 1,dims = c(1,2))

FeaturePlot(lsks.downsampled,c("Elane"),reduction = "dmap_nocc",dims = c(1,2), min.cutoff = "q5", 
            max.cutoff = "q95", cols = c("grey90","blue")) + ylim(-0.042,0.032) + xlim(-0.02 ,0.06) + NoAxes()
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 287 rows containing missing values (geom_point).

DimPlot(subset.wt,reduction = "dmap_nocc" , cols = c("#999998")) + NoAxes() + NoLegend()

DimPlot(subset.pymt,reduction = "dmap_nocc" , cols = c("#0F7FFE")) + NoAxes() + NoLegend()

References:

  1. Butler, Andrew et al. Integrating Single-Cell Transcriptomic Data across Different Conditions, Technologies, and Species. Nature Biotechnology 36, no. 5 (May 2018): 411???20. https://doi.org/10.1038/nbt.4096.

  2. Fan, Jean et al. Characterizing Transcriptional Heterogeneity through Pathway and Gene Set Overdispersion Analysis. Nature Methods 13, no. 3 (March 2016): 241???44. https://doi.org/10.1038/nmeth.3734.

  3. Kharchenko, Peter V., Lev Silberstein, and David T. Scadden. Bayesian Approach to Single-Cell Differential Expression Analysis. Nature Methods 11, no. 7 (July 2014): 740???42. https://doi.org/10.1038/nmeth.2967.

  4. Scialdone, Antonio et al. Computational Assignment of Cell-Cycle Stage from Single-Cell Transcriptome Data. Methods (San Diego, Calif.) 85 (September 1, 2015): 54???61. https://doi.org/10.1016/j.ymeth.2015.06.021.

  5. Lun A, Risso D (2018). SingleCellExperiment: S4 Classes for Single Cell Data. R package version 1.4.0.

  6. Yang, Jennifer et al. Single Cell Transcriptomics Reveals Unanticipated Features of Early Hematopoietic Precursors. Nucleic Acids Research 45, no. 3 (17 2017): 1281???96. https://doi.org/10.1093/nar/gkw1214.

  7. Giladi, Amir et al. Single-Cell Characterization of Haematopoietic Progenitors and Their Trajectories in Homeostasis and Perturbed Haematopoiesis. Nature Cell Biology 20, no. 7 (July 2018): 836???46. https://doi.org/10.1038/s41556-018-0121-4.

  8. AlJanahi, Aisha A., Mark Danielsen, and Cynthia E. Dunbar. ‘An Introduction to the Analysis of Single-Cell RNA-Sequencing Data’. Molecular Therapy - Methods & Clinical Development 10 (21 September 2018): 189–96. https://doi.org/10.1016/j.omtm.2018.07.003.